Concrete Analysis

Sara Monica

June 27, 2022

Business Problem

This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.

“Concrete is the most important material in civil engineering” as said by Prof. I-Cheng Yeh

Concrete compression strength is determined not just only by water-cement mixture but also by other ingredients, and how we treat the mixture. Using this dataset, we are going to find “the perfect recipe” to predict the concrete’s compression strength, and how to explain the relationship between the ingredients concentration and the age of testing to the compression strength.

Data Processing

Load Libraries

library(dplyr)
library(caret)
library(tidyr)
library(randomForest)
library(ggplot2)
library(lime)
library(GGally)
library(performance) 
library(MLmetrics)
library(lmtest)
library(car)

Read Data

#  read data
train <- read.csv("data/data-train.csv")

The observation data consists of the following variables:

  • id: Id of each cement mixture,
  • cement: The amount of cement (Kg) in a m3 mixture,
  • slag: The amount of blast furnace slag (Kg) in a m3 mixture,
  • flyash: The amount of fly ash (Kg) in a m3 mixture,
  • water: The amount of water (Kg) in a m3 mixture,
  • super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,
  • coarse_agg: The amount of Coarse Aggreagate (Kg) in a m3 mixture,
  • fine_agg: The amount of Fine Aggreagate (Kg) in a m3 mixture,
  • age: the number of resting days before the compressive strength measurement,
  • strength: Concrete compressive strength measurement in MPa unit.

In order to ensure that the data is “fully prepared,” we demonstrate how to use various data transformations, scaling, handling outliers, or any other statistical strategy. It is best practice to preprocess our data before performing analysis. Data must first be cleaned and transformed before being used for analysis and modeling.

Pre-processing

# data structure
glimpse(train)
## Rows: 825
## Columns: 10
## $ id          <chr> "S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10…
## $ cement      <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0, 19…
## $ slag        <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, 132…
## $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ water       <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, 228…
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ coarse_agg  <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932.0, …
## $ fine_agg    <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0, 82…
## $ age         <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90, 36…
## $ strength    <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29, 38…
#  check missing value
colSums(is.na(train))
##          id      cement        slag      flyash       water super_plast 
##           0           0           0           0           0           0 
##  coarse_agg    fine_agg         age    strength 
##           0           0           0           0
# remove duplicate 
unique(train)
# remove row containing NA value 
train <- train %>% filter(complete.cases(.))

Check data distribution of each predictor

train %>% 
   select_if(is.numeric) %>% 
   boxplot(main = 'Distribution of Each Predictor', xlab = 'Predictor', ylab = 'Values')

Our data can be visually examined to identify whether any outliers are present. By requiring our model to accommodate them, outliers impact the dependent variable we’re developing. As their names indicate, outliers lie outside our model’s majority. The resolving capability of our model might be reduced if we include outliers. We can observe from the boxplot that some variables, such age, super plast, and slag, have noticeable outliers.

Distribution on Each Predictor

train %>% 
    select_if(is.numeric) %>% 
    pivot_longer(cols = -strength, names_to = 'predictor') %>% 
    ggplot(aes(x = value)) +
    geom_density() +
    facet_wrap(~predictor, scales = 'free_x')  +
    labs(
        title = 'Density graph of each variable',
        x = 'variable',
        y = 'Frequency'
    )

The graph shows that flyash, slag, coarse, fine, and cement are all fairly uniform in shape. This can imply that these variables are combined freely and without following a prescribed dosage. These are the fundamental components of concrete. Most of the recipes tend to utilize between 150 and 200 milliliters of water. Superplast appears to be either not utilized at all or used in an amount of 10. Most of the recipes are 7, 30, 60, 90–100 days.

Data Transformation

Let’s see the trend of our data for each predictor

train %>% 
    select_if(is.numeric) %>% 
    pivot_longer(cols = -strength, names_to = 'predictor') %>% 
    ggplot(aes(x = value, y = strength)) +
    geom_point() +
    geom_smooth(method = 'loess', formula = 'y ~ x') +
    facet_wrap(~predictor, scales = 'free_x')  +
    labs(
        title = 'Trends in each variable',
        x = 'Variable',
        y = 'Values'
    )

According to the plots, cement and super_plast show a significant positive correlation with strength. A minor negative correlation exists between coarse_agg, fine_agg, fly_ash, and slag. There is no direct relationship between age and water. Water has a cyclical pattern, whereas the period displays a negative curve. With linear data, regression models perform the best. We can attempt to modify the distribution to become more linear by transforming the non-linear data.

Age as Log(Age)

train %>% 
    select_if(is.numeric) %>% 
    select(age, strength) %>% 
    ggplot(aes(x = log(age), y = strength)) +
    geom_point() +
    geom_smooth(method = 'loess', formula = 'y ~ x') + 
    labs(
        title = 'Correlation between log(age) and strength',
        x = 'log(age)',
        y = 'strength'
    )

We see that the two relation is more linear. We’ll persist this change to our dataset by transforming Age to Log(Age)

Transform Age to Log(Age)

train_log <- train %>% mutate(age = log(age)) 

Water as Log(Water)

train %>% 
    select_if(is.numeric) %>% 
    select(water, strength) %>% 
    ggplot(aes(x = log(water), y = strength)) +
    geom_point() +
    geom_smooth(method = 'loess', formula = 'y ~ x') + 
    labs(
        title = 'Correlation between log(age) and strength',
        x = 'log(age)',
        y = 'strength'
    )

Since the shape is still cyclical, this transform has no effect.

Data Scaling

train_log %>%
    select_if(is.numeric) %>% 
    pivot_longer(cols = -strength, names_to = 'predictor') %>% 
    group_by(predictor) %>% 
    summarize(value = max(value)) %>% 
    ggplot(aes(x = predictor, y = value)) +
    geom_col(fill = 'pink') + 
    labs(
        title = 'Data Range Before Scaling',
        x = 'Variable',
        y = 'Value'
    ) + theme_minimal()

Before we scale train_log, we need to remove non-numeric column id

# data scaling
train_scale <- train_log %>% select(-id) %>% as.data.frame()
train_scale[,-9] <- scale(train_scale[,-9])
train_scale %>%
    pivot_longer(cols = -strength, names_to = 'predictor') %>% 
    group_by(predictor) %>% 
    summarize(value = max(value)) %>% 
    ggplot(aes(x = predictor, y = value)) +
    geom_col(fill = 'pink') + 
    labs(
        title = 'Data Range After Scaling',
        x = 'Variable',
        y = 'Values'
    ) + theme_minimal()

Researchers need to scale the data to depict each variable’s impact accurately. By mounting the data, we give each variable equal weight so that we can appropriately interpret the model’s coefficients.

Exploratoty Data Analysis

Correlation

#  cek korelasi
ggcorr(train_scale, hjust = 1, label = TRUE)

The stronger the correlation, or how near 1 or -1 it is, the more closely related the predictors are. The correlation matrix graphic above shows the correlatiion on each variables. In our dataset, super_plast and water have the highest negative correlations (-0.6) also strength and age have the highest positive correlations (0.6)

age, cement, and super_plast have the most significant positive strength relationships. This indicates that the variables positively and substantially contribute to strength. On the other hand, the most vital negative link is found with water most negative correlation on the other hand.

Handling Outliers

Find outlier value

# Check the outlier after scaling
boxplot.stats(train_scale$strength)$out
boxplot((train_scale$strength))

Remove the outlier after scaling

# remove the outlier after scaling
# train_clean <- train_scale[train_scale$strength < 2.652730 ,]
train_clean <- train_scale[train_scale$strength < 79.40,]
# train_clean <- train_scale[train_scale$strength > -2.472288,]
boxplot(train_clean$strength)

Modeling with one predictor

model_ols <- lm(formula = strength ~ cement, data = train_scale)
model_ols_no_out <- lm(formula = strength ~ cement, data = train_clean)

Plot the difference between two data

plot(strength ~ cement, data=train_scale)
abline(model_ols, col = "red")
abline(model_ols_no_out, col = "green")

High Leverage, Low Influence: Because the graph shows that the outlier of the strength variable is at High Leverage, Low influence, then we analyze from R-Squared.

R-squared

summary(model_ols)$r.squared
## [1] 0.246024
summary(model_ols_no_out)$r.squared
## [1] 0.242262

Since the train_scale has a better r-square, we decided to not using train_clean

Data Distribution of Each Predictor

train_scale  %>% as_tibble() %>%
    pivot_longer(cols = -c(strength), names_to='Predictor') %>% 
    ggplot(aes(x = Predictor, y = value)) +
    geom_jitter(col = 'blue', alpha = 0.2) +
    labs(
        title = 'Data Distribution of Each Predictor',
        x = 'Predictor',
        y = 'Values'
    ) + theme_minimal()

Model Fitting and Evaluation

Data Splitting

We now split the data into train and validation sets. The training set is used to train the model, which is checked against the validation set.

library(rsample)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)

index <- sample(nrow(train_scale), nrow(train_scale)*0.8)

data_train <- as.data.frame(train_scale[index,])
data_validation <- as.data.frame(train_scale[-index,])

Check the Data Split

set.seed(123)
control <- trainControl(method = "cv", number = 10)

ca_model <- train(strength ~ ., data = data_train, method = "lm", trControl = control)

ca_model
## Linear Regression 
## 
## 660 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 595, 595, 593, 593, 595, 593, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   6.949791  0.8291556  5.431038
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Model Fitting

Model with No Predictor

#  model tanpa prediktor
model_none <- lm(formula = strength ~ 1, data = data_train)

Model with All Predictors

#  model dengan semua prediktor
model_all <- lm(strength ~ ., data_train)

Variable Selection : Step-Wise Regression Model

We’ve built model.none that uses no predictor and model.all that uses all variables. Stepwise regression is a method to pick out the optimal model using the Akaika Information Criterion (AIC) as is metrics. The method optimizes the model for the least AIC, meaning the least information loss. Let’s try to pick the important variables using stepwise regression. It uses a greedy algorithm to find a local minima. Therefore, it does not guarantee the best model.

1. Backward

#  stepwise regression: backward elimination
model_backward <- step(object = model_all,
                       direction = "backward",
                       trace = FALSE) #  agar proses step-wise tidak ditampilkan

2. Forward

model_forward <- step(
  object = model_none, #  batas bawah
  direction = "forward",
  scope = list(upper = model_all), #  batas atas
  trace = FALSE) #  tidak menampilkan proses step-wise

3. Both

model_both <- step(
  object = model_none, #  bawah batas
  direction = "both",
  scope = list(upper = model_all), #  batas atas
  trace = FALSE
)

Model Evaluation

We developed a model_none that doesn’t employ a model or predictor. All variables are used. The Akaike Information Criterion (AIC) and metrics are used stepwise regression to determine the best model. To minimize information loss, the technique optimizes the model for the lowest AIC. Let’s use stepwise regression to identify the crucial factors. To locate a local minimum, it employs a greedy method. As a result, it cannot assure the best model.

comparison <- compare_performance(model_none, model_all, model_forward, model_both)
as.data.frame(comparison)
##            Name Model      AIC        AIC_wt      BIC        BIC_wt        R2
## 1    model_none    lm 5587.899 1.179650e-250 5596.883 2.674447e-243 0.0000000
## 2     model_all    lm 4436.943  9.971489e-01 4481.865  3.551574e-01 0.8293458
## 3 model_forward    lm 4451.224  7.900568e-04 4487.162  2.513473e-02 0.8245528
## 4    model_both    lm 4449.306  2.061075e-03 4480.752  6.197079e-01 0.8245309
##   R2_adjusted      RMSE     Sigma
## 1   0.0000000 16.631325 16.643938
## 2   0.8272486  6.870453  6.917782
## 3   0.8229407  6.966266  7.003505
## 4   0.8231894  6.966700  6.998585

Evaluation Function

eval_recap <- function(truth, estimate){
  
  df_new <- data.frame(truth = truth,
                       estimate = estimate)
  
  data.frame(RMSE = RMSE(estimate, truth),
             MAE = MAE(estimate, truth),
             "R-Square" = R2_Score(estimate, truth),
             MAPE = MAPE(estimate, truth),
             check.names = F
             ) %>% 
    mutate(MSE = sqrt(RMSE))
}

Model None - Evaluation

# Model None - Evaluation
pred_none_val <- predict(model_none, data_validation)

eval_recap(truth = data_validation$strength,
           estimate = pred_none_val)
##       RMSE      MAE    R-Square      MAPE      MSE
## 1 17.01379 13.48017 -0.01144447 0.5953731 4.124777

Model All - Evaluation

pred_all_val <- predict(object = model_all, newdata = data_validation)

eval_recap(truth = data_validation$strength,
           estimate = pred_all_val)
##       RMSE      MAE  R-Square      MAPE      MSE
## 1 7.905688 5.945323 0.7816166 0.1917857 2.811706

Model Step-Wise Both - Evaluation

pred_both_val <- predict(object = model_both, newdata = data_validation)

eval_recap(truth = data_validation$strength,
           estimate = pred_both_val)
##       RMSE     MAE  R-Square      MAPE     MSE
## 1 8.005901 6.05212 0.7760451 0.1982521 2.82947

As shown above, model_all has the best evaluation score. Now, we’re check the linearity assumption

Checking Assumptions

Linear models are made with 4 assumptions. Before we carry on, we have to check whether these assumptions hold for our model.

Assumption of linearity

The assumption of linearity assumes that there exists a linear relationship between the predictors and the targe variable, so that our model can correctly describe it. A visual way to evaluate this is to plot the value of residues between our plot and the model.

Visualization of residual histogram using hist() . function

#  histogram residual
ggplot(data = as.data.frame(model_all$residuals), aes(x = model_all$residuals)) +
  geom_histogram(fill = "#CC0000", color = "orange", bins = 30) +
  labs( title = "Regression Residual Distribution", subtitle = "Log Transformation", x = "residual")

Statistics Test with `shapiro.test()``

Shapiro-Wilk hypothesis test:

  • H0: Residuals are normal distributed
  • H1: Residuals are not normally distributed (heteroscedastic)
#  shapiro test dari residual
shapiro.test(model_all$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_all$residuals
## W = 0.99677, p-value = 0.2076
check_normality(model_all)
## OK: residuals appear as normally distributed (p = 0.222).

Based on the result, the residuals are normally distributed.

VIF : Independence of Variable

Multicollinearity is a condition with a strong correlation between predictors. This is undesirable because it indicates a redundant predictor in the model, which should be able to choose only one of the variables with a solid relationship. It is hoped that multicollinearity will not occur

Test the VIF (Variance Inflation Factor) with the vif() function from the car package: * VIF value > 10: multicollinearity occurs in the model * VIF value < 10: there is no multicollinearity in the model

vif(model_all)
##      cement        slag      flyash       water super_plast  coarse_agg 
##    7.779963    6.785129    6.297809    6.777976    3.072010    4.873030 
##    fine_agg         age 
##    7.150553    1.030438

The test result means there is no multicollinearity in the model

Homoscedasticity

Homoscedasticity assumption states that the error term in the relationship between the predictor and target variables is constant across all values of inputs. This assumption can be checked using the Breusch-Pagan test with hypotheses :

  • H0: Value of error is the same across all inputs (homoscedastic)
  • H1: Value of error is not the same across all range of inputs (heteroscedastic)
plot(x = model_all$fitted.values, y = model_all$residuals)
abline(h = 0, col = "#FF0000", ylab = 'Residuals', xlab = 'Prediction')

We can test the homoscedasticity of the model using the Breusch-Pagan test.

bptest(model_all)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_all
## BP = 74.093, df = 8, p-value = 7.492e-13

Based on the result, the error are not same across all range of inputs.

Even though our linear model fails the tests, we can still try to conclude it. Our model’s mean average percentage error is a decent 0.198.

coef_all <- model_all$coefficients[-1]
barplot(coef_all, xlab = names(coef_all), main = 'Influence of `Model_all` Predictor',  ylab = 'Value')

Model Interpretation and Improvement Ideas

We shouldn’t transform the data_train because we already did it before in the beginning such as scaling, tranforming several variable into log, or removing any outliers and we are not tranforming the targeted variabel into a scaled version, because we wont scaled back the Test Result in the end of our research.

One-Way ANOVA

anova_train <- aov(formula = strength ~ ., data = data_train)
summary(anova_train)
##              Df Sum Sq Mean Sq  F value   Pr(>F)    
## cement        1  45198   45198  944.470  < 2e-16 ***
## slag          1  13610   13610  284.391  < 2e-16 ***
## flyash        1  14776   14776  308.765  < 2e-16 ***
## water         1   8018    8018  167.546  < 2e-16 ***
## super_plast   1    921     921   19.240 1.34e-05 ***
## coarse_agg    1    487     487   10.171  0.00149 ** 
## fine_agg      1    316     316    6.599  0.01042 *  
## age           1  68077   68077 1422.548  < 2e-16 ***
## Residuals   651  31154      48                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Orthogonal Polynomial

model_polym <- lm(strength ~ polym(cement , slag, flyash, water, super_plast, coarse_agg, fine_agg, age, degree = 2, raw = T), data_train )


pred_polym_val <- predict(object = model_polym, newdata = data_validation)

eval_recap(truth = data_validation$strength,
           estimate = pred_polym_val)
##      RMSE      MAE  R-Square      MAPE      MSE
## 1 6.36655 4.805496 0.8583722 0.1431404 2.523202

Checking Assumptions of Orthogonal Polynomial Model

Residuals Autocorrelation

We will check whether the residuals are correlating with itself using the Durbin-Watson test.

  • H0: p-value > 0.05 : Residuals are not autocorrelated
  • H1: p-value < 0.05 : Residuals are autocorrelated
dwtest(model_polym)
## 
##  Durbin-Watson test
## 
## data:  model_polym
## DW = 2.0129, p-value = 0.5629
## alternative hypothesis: true autocorrelation is greater than 0

Based on the result, the residuals are not autocorrelated.

VIF: Independence of variabels

Due to the variables in our model no longer existing independently, we cannot estimate this factor when utilizing polynomials. This will become clearer as we examine the model in more detail in the following subsection.

Statistics Test with `shapiro.test()``

Shapiro-Wilk hypothesis test:

  • H0: Residuals are normal distributed
  • H1: Residuals are not normally distributed (heteroscedastic)
#  shapiro test dari residual
shapiro.test(model_polym$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_polym$residuals
## W = 0.99168, p-value = 0.0009204
check_normality(model_polym)
## Warning: Non-normality of residuals detected (p < .001).
#  histogram residual
ggplot(data = as.data.frame(model_polym$residuals), aes(x = model_polym$residuals)) +
  geom_histogram(fill = "pink", color = "black", bins = 30) +
  labs( title = "Regression Residual Distribution", subtitle = "Log Transformation", x = "residual")

Based on the result, the residuals are not normally distributed.

Homoscedasticity

Homoscedasticity assumption states that the error term in the relationship between the predictor and target variables is constant across all values of inputs. This assumption can be checked using the Breusch-Pagan test with hypotheses :

  • H0: Value of error is the same across all inputs (homoscedastic)
  • H1: Value of error is not the same across all range of inputs (heteroscedastic)

We can test the homoscedasticity of the model using the Breusch-Pagan test.

bptest(model_polym)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_polym
## BP = 153.77, df = 44, p-value = 4.359e-14
plot(x = model_polym$fitted.values, y = model_polym$residuals)
abline(h = 0, col = "red", ylab = 'Residuals', xlab = 'Prediction')

Based on the result, the error are not same across all range of inputs.

Even though our linear model fails the tests, we can still try to conclude it. Our model’s mean average percentage error is a decent 0.143.

Random Foresttion

Create random forest model as model_rf

set.seed(123)
model_rf <- randomForest(x = data_train %>% select(-strength),
                         y = data_train$strength, 
                         ntree = 500)

model_rf
## 
## Call:
##  randomForest(x = data_train %>% select(-strength), y = data_train$strength,      ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 34.47698
##                     % Var explained: 87.54

Check the summary and Predictor contribution on Targeted Variable

model_rf$finalModel
## NULL
varImp(model_rf)
##               Overall
## cement      36292.387
## slag         9936.825
## flyash       8554.180
## water       22655.250
## super_plast 17046.985
## coarse_agg  10924.007
## fine_agg    12855.460
## age         52901.006

Model Random Forest - Evaluation

pred_rf_val <- predict(object = model_rf, newdata = data_validation)


eval_recap(truth = data_validation$strength,
           estimate = pred_rf_val)
##       RMSE      MAE  R-Square      MAPE      MSE
## 1 6.055747 4.368295 0.8718627 0.1632309 2.460843

Random Forest Variable Importance on Targeted Variabel

library("tibble")
model_rf$importance %>% 
  as.data.frame() %>% 
  arrange(-IncNodePurity) %>% 
  rownames_to_column("variable") %>% 
  head(10) %>% 
  ggplot(aes(IncNodePurity, 
             reorder(variable, IncNodePurity))
         ) +
  geom_col(fill = "firebrick") +
  labs(x = "Importance",
       y = NULL,
       title = "Random Forest Variable Importance")

The plot above showing how big the influence of each predictor, top 3 predictor who correlate with strength is age, cement and water

Lime Interpretation

library(lime)

set.seed(123)
explainer <- lime(x = data_train %>% select(-strength),
                  model = model_rf)

model_type.randomForest <- function(x){
  return("regression") # for regression problem
}

predict_model.randomForest <- function(x, newdata, type = "response") {

    # return prediction value
    predict(x, newdata) %>% as.data.frame()
    
}

#  Select only the first 4 observations
selected_data <- data_validation %>% 
  select(-strength) %>% 
  slice(1:4)

#  Explain the model
set.seed(123)
explanation <- explain(x = selected_data, 
                       explainer = explainer,
                       kernel_width = 1,
                       dist_fun = "manhattan",
                       n_features = 8 #  Number of features to explain the model
                       )

Since we’re using scaled data from the beginning, so to visualize model_rf, we’re still using scaled data.

Random Forest Visualization dan Interpretation

plot_features(explanation = explanation)

Explanation Fit indicate how good LIME explain the model, kind of like the \(R^2\) (R-Squared) value of linear regression. Here we see the Explanation Fit only has values around 0.50-0.7 (50%-70%), which can be interpreted that LIME can only explain a little about our model. Almost all of the cases reached the standard which >= 50% (0.5), only Case 4 has explanation fit under 0.50. We also can summarise that Case 3 has the biggest Explanation, but Case 1 has the biggest Prediction.

Support Vector Machine

library(e1071)
model_svm <- svm(strength ~ ., data = data_train)
pred_svm_val <- predict(object = model_svm, newdata = data_validation)


eval_recap(truth = data_validation$strength,
           estimate = pred_svm_val)
##       RMSE      MAE  R-Square      MAPE      MSE
## 1 5.544694 3.793846 0.8925775 0.1214288 2.354717

The SVR model has higher performance compared to any model that we made before. However, we will still use both model for further analysis both as comparison and as examples.

Lime Interpretation

# create the explanation for the SVR model.
set.seed(123)
explainer_svm <- lime(x = data_train %>% select(-strength), 
                  model = model_svm)

# Create SVR model specification for lime.
model_type.svm <- function(x){
  return("regression") # for regression problem
}

predict_model.svm <- function(x, newdata, type = "response") {

    # return prediction value
    predict(x, newdata) %>% as.data.frame()
    
}

Random Forest Visualization dan Interpretation

set.seed(123)
explanation_svm <- explain(x = selected_data, 
                       explainer = explainer_svm,
                       kernel_width = 1,
                       dist_fun = "manhattan",
                       feature_select = "auto", # Method of feature selection for lime
                       n_features = 10 # Number of features to explain the model
                       )

plot_features(explanation_svm)

Explanation Fit indicate how good LIME explain the model, kind of like the \(R^2\) (R-Squared) value of linear regression. Here we see the Explanation Fit only has values around 0.50-0.7 (50%-70%), which can be interpreted that LIME can only explain a little about our model. Almost all of the cases reached the standard which >= 50% (0.5), only Case 4 has explanation fit under 0.50. We also can summarise that Case 3 has the biggest Explanation, but Case 1 has the biggest Prediction.

From Case 3, we get the insigth of three predicor who has the big influence to strength is age, cement, and water. And on Case 4, cement, water and coarse_agg dominated on who big the can control the concrete strength.

Finding a Better Material Composition

Linear Model

# Gather Top 10 "Strength"
top_mix <- train_scale %>% arrange(-strength) %>% head(20)
influence <- top_mix %>% pivot_longer(cols = -c(strength), names_to='Predictor')

#  train_class the data
data_train %>% 
    pivot_longer(cols = -c(strength), names_to='Predictor') %>% 
    ggplot(aes(x = Predictor, y = value)) +
    geom_jitter(col = 'red', alpha = 0.2) + 
    geom_jitter(data = influence, aes(x = Predictor, y = value), col = 'blue')  + 
    labs(
        title = 'Comparing top 10 Predictor vs. Data',
        x = 'Predictor',
        y = 'Values'
    )

Clustering Age to three Classes

train_class <-  train %>% select(-id) %>% mutate(
  age = case_when(age < 40 ~ "< 40",
                  age >= 40 & age <= 80 ~ "40 - 80",
                  age > 80 ~ "> 80",
                  
))

Age Class on Strength

ggplot(data = train_class, aes(x = age, y = strength)) +
  geom_boxplot() +
  geom_jitter(aes(color = age), show.legend = F) +
  labs(x = "Age",
       y = "Strength") +
  theme_minimal()

It is known that different days affect construction strength differently based on the outcomes of the statistical summary and visualization of the distribution of strength data for each age class above. strength diminishes with time for a median of 30 days, or less than 40 days on average. At the same time,  40 to 80 days are best because they generate the most with a median strength of 50.

Anova Model

anova_class <- aov(formula = strength ~ age, data = train_class)
summary(anova_class)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## age           2  49007   24504   111.2 <2e-16 ***
## Residuals   822 181204     220                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p value is less than the significance level (alpha), which is 5% or 0.05, we can conclude that there is a significant difference between the time period of age applied to the strength of the construction.

Pairwise Mean Comparison

TukeyHSD(anova_class)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = strength ~ age, data = train_class)
## 
## $age
##                   diff        lwr       upr     p adj
## > 80-< 40    15.850947 12.6941891 19.007704 0.0000000
## 40 - 80-< 40 20.040109 15.6653237 24.414894 0.0000000
## 40 - 80-> 80  4.189162 -0.8168007  9.195125 0.1216492

Except for the fumigation dose between 40-80 - > 80, two of the three p-values were below the significance level (alpha), or 5% or 0.05. We can assume that concrete strength better after 40 days.

Anova Assumption

Homogeneity of Variance

Hypothesis : - H0: Between categories has a homogeneous variance. - H1: Between categories has a heterogeneous variance.

leveneTest(strength ~ age, train_class)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   2  3.5031 0.03056 *
##       822                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p value is less than the significance level (alpha), which is 5% or 0.05, we can conclude that between categories have homogeneous variance.

Normality Residuals

Hypothesis : - H0: Residuals are normal distributed - H1: Residuals are not normally distributed (heteroscedastic)

shapiro.test(anova_class$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  anova_class$residuals
## W = 0.97014, p-value = 5.912e-12

Based on the result, the residuals are not normally distributed.

We can conclude that the best periode for concrete is around 40 - 80 days. It’s still okay to set the periode on > 80 days, because the median of the data distribution is not that far from the 40-80 days periode.

Composing Best Mixture Based on Random Forest and Linear Model Interpretation

# mix and match based on how big the influence of predictors on targeted variable
comp_1 <- top_mix %>% select(-strength) %>% mutate_all(mean)  %>% head(1)
comp_2 <- top_mix %>% select(-strength) %>% 
  mutate(cement = mean(top_mix$cement),
         water = min(top_mix$water),
         coarse_agg = mean(top_mix$coarse_agg)) %>% head(1)
comp_3 <- top_mix %>% select(-strength) %>% 
  mutate(cement = mean(top_mix$cement),
         slag = max(top_mix$slag),
         water = min(top_mix$water)) %>% head(1)
comp_4 <- top_mix  %>% select(-strength) %>% 
  mutate(cement = weighted.mean(top_mix$cement),
         flyash = min(top_mix$flyash),
         super_plast = mean(top_mix$super_plast),
         coarse_agg = mean(top_mix$coarse_agg)) %>% head(1)
# merged Top 5 Mix and New Composition
composition <- bind_rows(comp_1, comp_2, comp_3, comp_4)
# Predict New Composition with Model Polynomial
new_comp <- predict(model_svm, composition)
composition <- composition %>% mutate(strength = new_comp)
new_formula <- composition %>% mutate(formula = c('C1', 'C2', 'C3','C4'))
new_formula
##     cement      slag     flyash      water super_plast coarse_agg   fine_agg
## 1 1.089695 0.7572579 -0.6391721 -0.9236385   0.8343025 -0.1917599 -0.3567070
## 2 1.089695 1.3450618 -0.8457888 -1.7208632   2.6336581 -0.1917599 -0.2447674
## 3 1.089695 2.4343832 -0.8457888 -1.7208632   2.6336581 -0.3666832 -0.2447674
## 4 1.089695 1.3450618 -0.8457888 -1.6641549   0.8343025 -0.1917599 -0.2447674
##         age strength formula
## 1 0.7417364 75.05045      C1
## 2 1.1320080 78.12443      C2
## 3 1.1320080 76.77664      C3
## 4 1.1320080 81.63057      C4
train_scale %>% arrange(-strength) %>% head(1) %>%
  mutate(id = "C4",
         cement = new_formula[4,'cement'][[1]],
         flyash = new_formula[4,'flyash'][[1]],
         super_plast = new_formula[4,'super_plast'][[1]],
         coarse_agg = new_formula[4,'coarse_agg'][[1]],
         strength = new_formula[4,'strength'][[1]]
                   )
##     cement     slag     flyash     water super_plast coarse_agg   fine_agg
## 1 1.089695 1.345062 -0.8457888 -1.664155   0.8343025 -0.1917599 -0.2447674
##        age strength id
## 1 1.132008 81.63057 C4

Conclusion

In this research project, we have examined various concrete formulations with different strengths. We developed a model that aligns to the available information. Utilizing model as a framework, we developed a fresh formulation and, being used to predicted the strength.

Throughout this project, we have employed a Support Vector Machine/ Reggresion Model. Compared to a standard regression, the model better describes the data. As we have discovered, despite being more complicated, it is a model which could be understood. The prediction model implementing “model_svm” obtained MAE values of 3.79 and R Square of 89 percent on validation dataset and MAE values of 5.62 and R Square of 81 percent on test dataset.

The methodology adopted in this project can be used for other issues wherever we want to optimize a result. This method can resolve optimization issues, including improving a food product’s flavor, consistency, and texture or determining the ideal chemical mixture to produce a specific flavor and aroma or fragrance. Regression models offer a simple approach to get insight and identify possibilities when used on the relevant issue.

Submission

test <- read.csv("data/data-test.csv")
test[,-c(1,10)] <-  scale(test[,-c(1,10)])
test_clean <-  test %>% select(-c(id, strength))
comparison <- compare_performance(model_none, model_all, model_both, model_polym)
as.data.frame(comparison)
##          Name Model      AIC        AIC_wt      BIC        BIC_wt        R2
## 1  model_none    lm 5587.899 4.739201e-305 5596.883 3.951068e-262 0.0000000
## 2   model_all    lm 4436.943  4.006008e-55 4481.865  5.246883e-20 0.8293458
## 3  model_both    lm 4449.306  8.280290e-58 4480.752  9.155193e-20 0.8245309
## 4 model_polym    lm 4186.434  1.000000e+00 4393.077  1.000000e+00 0.8953114
##   R2_adjusted      RMSE     Sigma
## 1   0.0000000 16.631325 16.643938
## 2   0.8272486  6.870453  6.917782
## 3   0.8231894  6.966700  6.998585
## 4   0.8878215  5.381167  5.574563
#  predict target using your model
pred_test <- predict(object = model_svm, newdata = test_clean)

#  Create submission data
submission <- data.frame(id = test$id,
                         strength = pred_test
                         )

#  save data
write.csv(submission, "submission-sara.csv", row.names = F)

#  check first 3 data
head(submission, 3)
##     id strength
## 1 S826 40.99496
## 2 S827 30.51722
## 3 S828 41.74776
knitr::include_graphics("data/model-svm.png")